A startup is being considered - an online store selling food. According to the available event log of the mobile application, it is necessary to analyze the actions of users when purchasing goods.
Main tasks:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import scipy.stats as st
import numpy as np
def mf_diff(ind):
""" Сounting the number of users before and after data clipping in the context of a given attribute """
id_est = (
df_data.pivot_table(index=ind, aggfunc={'id':'nunique'})
.join(df_data2.pivot_table(index=ind, aggfunc={'id':'nunique'}), lsuffix='_before',rsuffix='_after')
)
id_est = id_est.append(id_est.apply('sum', axis=0).reset_index().set_index('index').rename({0:'Total'}, axis=1).T)
id_est['diff']=id_est.id_before-id_est.id_after
id_est['lost_share_proc'] = id_est.apply(lambda x: round(x['diff']*100/x['id_before'], 1) , axis=1)
id_est.index.name=ind
return id_est
def mf_bar(df, x_, y_, title, x_text, y_text):
""" Building a histogram """
fig=px.bar(df, x=x_, y=y_, text_auto=True)
fig.update_layout(height=500, width=800, title_text=title)
fig.update_xaxes(title_text=x_text)
fig.update_yaxes(title_text=y_text)
fig.show()
def mf_z_test(goal_list, all_list, alpha=0.05):
""" Calculation of p-value for binomial distribution (z-test)"""
goal = np.array(goal_list) # conversion numerator
alll = np.array(all_list) # conversion denominator
p1 = goal[0]/alll[0]
p2 = goal[1]/alll[1]
p_combined = (goal[0]+ goal[1])/(alll[0] + alll[1])
difference = p1 - p2
z_value = difference / ((p_combined * (1 - p_combined) * (1 / alll[0] + 1 / alll[1]))**0.5)
distr = st.norm(0, 1)
p_value = (1 - distr.cdf(abs(z_value))) * 2
if p_value < alpha:
result = 'H1'
else:
result = 'H0'
return [p_value, alpha, result]
df_data = pd.read_csv('data.csv', sep="\t")
df_data.columns = [x.lower() for x in df_data.columns]
df_data.info()
print(f"\nThe amount of occupied dataset memory: {(df_data.memory_usage('deep')/(1024**2)).sum():.1f} Mb\n")
<class 'pandas.core.frame.DataFrame'> RangeIndex: 244126 entries, 0 to 244125 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 eventname 244126 non-null object 1 deviceidhash 244126 non-null int64 2 eventtimestamp 244126 non-null int64 3 expid 244126 non-null int64 dtypes: int64(3), object(1) memory usage: 7.5+ MB The amount of occupied dataset memory: 7.5 Mb
# renaming columns
df_data.rename({'eventname':'event', 'deviceidhash':'id', 'eventtimestamp':'ts', 'expid':'gr'}, axis=1, inplace=True)
df_data.head()
| event | id | ts | gr | |
|---|---|---|---|---|
| 0 | MainScreenAppear | 4575588528974610257 | 1564029816 | 246 |
| 1 | MainScreenAppear | 7416695313311560658 | 1564053102 | 246 |
| 2 | PaymentScreenSuccessful | 3518123091307005509 | 1564054127 | 248 |
| 3 | CartScreenAppear | 3518123091307005509 | 1564054127 | 248 |
| 4 | PaymentScreenSuccessful | 6217807653094995999 | 1564055322 | 248 |
# analysis of group column values
df_data.gr.value_counts()
248 85747 246 80304 247 78075 Name: gr, dtype: int64
# checking for complete duplicate strings
print(f"Number of lines - full duplicates: {df_data.duplicated().sum()}")
Number of lines - full duplicates: 413
df_data.drop_duplicates(inplace=True)
print(f"Number of lines - full duplicates: {df_data.duplicated().sum()}")
Number of lines - full duplicates: 0
# the date of the event is converted to the date-time type
df_data.ts = pd.to_datetime(df_data.ts, unit='s')
# the index of the group is converted to an integer
df_data.gr = pd.to_numeric(df_data.gr, downcast='unsigned')
# the final structure of the dataframe after conversion
print(f"\nThe amount of occupied dataset memory after conversion: {(df_data.memory_usage('deep')/(1024**2)).sum():.1f} Mb\n")
df_data.info()
The amount of occupied dataset memory after conversion: 7.7 Mb <class 'pandas.core.frame.DataFrame'> Int64Index: 243713 entries, 0 to 244125 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 event 243713 non-null object 1 id 243713 non-null int64 2 ts 243713 non-null datetime64[ns] 3 gr 243713 non-null uint8 dtypes: datetime64[ns](1), int64(1), object(1), uint8(1) memory usage: 7.7+ MB
# a separate date column is created
df_data['dt'] = df_data['ts'].dt.date
# total events in the data
df_data.event.value_counts().reset_index().rename({'index':'Event','event':'Number of events'}, axis=1)
| Event | Number of events | |
|---|---|---|
| 0 | MainScreenAppear | 119101 |
| 1 | OffersScreenAppear | 46808 |
| 2 | CartScreenAppear | 42668 |
| 3 | PaymentScreenSuccessful | 34118 |
| 4 | Tutorial | 1018 |
# number of unique users
print(f"Number of unique users: {len(df_data.id.unique())}")
Number of unique users: 7551
# average number of events per user
print("Median quantity " +
f"events per user: {df_data.pivot_table(index='id', aggfunc={'event':'count'}).median()[0]:.0f}")
Median quantity events per user: 20
Intermediate conclusions
# date range in the data
print(f"The data is presented from {df_data.dt.min()} to {df_data.dt.max()}")
The data is presented from 2019-07-25 to 2019-08-07
# data preparation
temp = (
df_data
.pivot_table(index=df_data.ts.dt.round('1h'), aggfunc={'event':'count'})
.reset_index()
)
# graphics preparation
fig = px.line(temp, x="ts", y="event", title='Dynamics of recorded events')
fig.update_xaxes(title_text="Date")
fig.update_yaxes(title_text="Events number")
fig.show()
Intermediate conclusions
# days with incomplete data are cut off
df_data2 = df_data.query("ts >= '2019-08-01'")
print(f"Number of unique users after data clipping: {df_data2.id.nunique()}")
Number of unique users after data clipping: 7534
# estimation of data loss by users in the context of events
mf_diff('event')
| id_before | id_after | diff | lost_share_proc | |
|---|---|---|---|---|
| event | ||||
| CartScreenAppear | 3749 | 3734 | 15 | 0.4 |
| MainScreenAppear | 7439 | 7419 | 20 | 0.3 |
| OffersScreenAppear | 4613 | 4593 | 20 | 0.4 |
| PaymentScreenSuccessful | 3547 | 3539 | 8 | 0.2 |
| Tutorial | 847 | 840 | 7 | 0.8 |
| Total | 20195 | 20125 | 70 | 0.3 |
# estimation of data loss by users in the context of groups
mf_diff('gr')
| id_before | id_after | diff | lost_share_proc | |
|---|---|---|---|---|
| gr | ||||
| 246 | 2489 | 2484 | 5 | 0.2 |
| 247 | 2520 | 2513 | 7 | 0.3 |
| 248 | 2542 | 2537 | 5 | 0.2 |
| Total | 7551 | 7534 | 17 | 0.2 |
Intermediate conclusions
# data preparation
grd1 = (
df_data2
.pivot_table(index='event', aggfunc={'event':'count'})
.rename({'event':'count'}, axis=1)
.reset_index()
.sort_values(by='count', ascending=False)
)
# graphics preparation
fig=px.pie(grd1, names='event', values='count')
fig.update_layout(height=500, width=500, title_text="Shares of events in the total volume of all events")
fig.show()
mf_bar(grd1, 'event', 'count', 'Histogram of the number of events','Events', 'Events number ')
Intermediate conclusions
# data preparation without Tutorial stage
df_data3 = df_data.query("ts >= '2019-08-01' and event != 'Tutorial'")
# counting the number of unique users at each step of the funnel
grd2 = (
df_data3.pivot_table(index='event', aggfunc={'id':'nunique'})
.reset_index()
.rename({'id':'count'}, axis=1)
.sort_values(by='count', ascending=False)
.reset_index(drop=True)
)
# counting conversions to the previous event
grd2['shift'] = (
grd2['count']
.shift(periods=1, axis=0)
)
grd2['conversion'] = grd2.apply(lambda x: round(x['count']*100/x['shift'],1), axis=1)
grd2['delta'] = grd2['shift'] - grd2['count']
grd2
| event | count | shift | conversion | delta | |
|---|---|---|---|---|---|
| 0 | MainScreenAppear | 7419 | NaN | NaN | NaN |
| 1 | OffersScreenAppear | 4593 | 7419.0 | 61.9 | 2826.0 |
| 2 | CartScreenAppear | 3734 | 4593.0 | 81.3 | 859.0 |
| 3 | PaymentScreenSuccessful | 3539 | 3734.0 | 94.8 | 195.0 |
# # building a food sales funnel
fig = go.Figure(
go.Funnel(y=list(grd2.reset_index()['event']), x=list(grd2.reset_index()['count'])
))
fig.update_layout(height=500, width=950, title = "Food sales funnel on the website")
fig.update_yaxes(title_text="Funnel Steps")
fig.show()
Intermediate conclusions
# checking for the flow of users from group to group
print("The number of users who switched to other groups during the test: "+
f"{df_data3.pivot_table(index='id', aggfunc={'gr':'nunique'}).query('gr>1').count()[0]}")
The number of users who switched to other groups during the test: 0
# number of users participating in the A/B test
grd=df_data3.pivot_table(index='gr', aggfunc={'id':'nunique'}).reset_index()
mf_bar(grd, 'gr', 'id', 'Number of unique groups by A/B-test groups', 'Test group number', 'Number of users')
print("Checking the equality of the number of users by test groups:")
print(f"246/247: {grd.set_index('gr').loc[246, 'id']/grd.set_index('gr').loc[247, 'id']:.3f}")
print(f"247/248: {grd.set_index('gr').loc[247, 'id']/grd.set_index('gr').loc[248, 'id']:.3f}")
print(f"246/248: {grd.set_index('gr').loc[246, 'id']/grd.set_index('gr').loc[248, 'id']:.3f}")
Checking the equality of the number of users by test groups: 246/247: 0.988 247/248: 0.991 246/248: 0.979
Intermediate conclusions
# sales funnel by three groups
grd= (
df_data3
.pivot_table(index='event', columns='gr', aggfunc={'id':'nunique'})
.droplevel(level=0, axis=1)
.sort_values(by=246, ascending=False)
)
grd
| gr | 246 | 247 | 248 |
|---|---|---|---|
| event | |||
| MainScreenAppear | 2450 | 2476 | 2493 |
| OffersScreenAppear | 1542 | 1520 | 1531 |
| CartScreenAppear | 1266 | 1238 | 1230 |
| PaymentScreenSuccessful | 1200 | 1158 | 1181 |
# building a segmented funnel
fig = go.Figure()
for i in grd.columns:
fig.add_trace(go.Funnel(
name = i,
y = list(grd.index),
x = list(grd.loc[:,i])
))
fig.update_layout(title = "Funnel of food sales on the website by A/B test groups")
fig.update_yaxes(title_text="Funnel Steps")
fig.show()
A/B test results analysis plan
# data preparation - the number of unique users for each group by funnel steps
df_t= df_data3.pivot_table(index='gr', columns='event', aggfunc={'id':'nunique'}).droplevel(level=0, axis=1)
# adding combined data by 246 and 247 groups and sorting by funnel steps
df_t = (
df_t.append(
df_data3.query("gr in (246,247)")
.pivot_table(index='event', aggfunc={'id':'nunique'})
.rename({'id':'246_247'}, axis=1)
.T
)
.T
.sort_values(by=246, ascending=False)
.T
)
# adding totals by groups
df_t['total']=(
df_data3
.pivot_table(index='gr', aggfunc={'id':'nunique'})
.append(pd.DataFrame(df_data3.query("gr in (246,247)")['id'].nunique(), columns=['id'], index=['246_247']))
.rename({'id':'total'}, axis=1)
)
print("The number of unique users by groups and funnel steps:")
df_t
The number of unique users by groups and funnel steps:
| event | MainScreenAppear | OffersScreenAppear | CartScreenAppear | PaymentScreenSuccessful | total |
|---|---|---|---|---|---|
| 246 | 2450 | 1542 | 1266 | 1200 | 2483 |
| 247 | 2476 | 1520 | 1238 | 1158 | 2512 |
| 248 | 2493 | 1531 | 1230 | 1181 | 2535 |
| 246_247 | 4926 | 3062 | 2504 | 2358 | 4995 |
# checking user shares by funnel steps by groups
print("Checking user shares by funnel steps by groups as a percentage:")
df_t.apply(lambda x: x*100/x['total'], axis=1)
Checking user shares by funnel steps by groups as a percentage:
| event | MainScreenAppear | OffersScreenAppear | CartScreenAppear | PaymentScreenSuccessful | total |
|---|---|---|---|---|---|
| 246 | 98.670963 | 62.102296 | 50.986710 | 48.328635 | 100.0 |
| 247 | 98.566879 | 60.509554 | 49.283439 | 46.098726 | 100.0 |
| 248 | 98.343195 | 60.394477 | 48.520710 | 46.587771 | 100.0 |
| 246_247 | 98.618619 | 61.301301 | 50.130130 | 47.207207 | 100.0 |
# threshold level of stat significance
alpha = 0.05
# Shidak's correction for 4 funnel steps and 4 comparisons
alpha_sh = 1-(1-alpha)**(1/(4*4))
print(f"Threshold level of significance after the Shidak amendment: {alpha_sh:.2%}")
Threshold level of significance after the Shidak amendment: 0.32%
# calculation of all stat tests
voc1={'246 to 247 (А1/А2)':[246, 247], '246 to 248 (А1/В)':[246, 248]
, '247 to 248 (А2/В)':[247, 248], '246 & 247 to 248 (А12/В)':['246_247', 248]}
for j in voc1:
groups = voc1[j]
print(f"\n\n---- Groups comparison {j} by funnel steps:")
for i in df_t.columns:
if i !='total':
res = mf_z_test([df_t.loc[groups[0], i], df_t.loc[groups[1], i]]
, [df_t.loc[groups[0], 'total'], df_t.loc[groups[1], 'total']], alpha_sh)
if res[2] == 'H0':
text = f"the null hypothesis is not rejected (the shares are the same) at the significance level {res[0]:.1%}"
else:
text = f"an alternative hypothesis was adopted (the shares are different) at the level of significance {res[0]:.1%}"
print(f"{i}: \n "+text)
---- Groups comparison 246 to 247 (А1/А2) by funnel steps:
MainScreenAppear:
the null hypothesis is not rejected (the shares are the same) at the significance level 75.3%
OffersScreenAppear:
the null hypothesis is not rejected (the shares are the same) at the significance level 24.8%
CartScreenAppear:
the null hypothesis is not rejected (the shares are the same) at the significance level 22.9%
PaymentScreenSuccessful:
the null hypothesis is not rejected (the shares are the same) at the significance level 11.4%
---- Groups comparison 246 to 248 (А1/В) by funnel steps:
MainScreenAppear:
the null hypothesis is not rejected (the shares are the same) at the significance level 33.9%
OffersScreenAppear:
the null hypothesis is not rejected (the shares are the same) at the significance level 21.4%
CartScreenAppear:
the null hypothesis is not rejected (the shares are the same) at the significance level 8.1%
PaymentScreenSuccessful:
the null hypothesis is not rejected (the shares are the same) at the significance level 21.7%
---- Groups comparison 247 to 248 (А2/В) by funnel steps:
MainScreenAppear:
the null hypothesis is not rejected (the shares are the same) at the significance level 51.9%
OffersScreenAppear:
the null hypothesis is not rejected (the shares are the same) at the significance level 93.3%
CartScreenAppear:
the null hypothesis is not rejected (the shares are the same) at the significance level 58.8%
PaymentScreenSuccessful:
the null hypothesis is not rejected (the shares are the same) at the significance level 72.8%
---- Groups comparison 246 & 247 to 248 (А12/В) by funnel steps:
MainScreenAppear:
the null hypothesis is not rejected (the shares are the same) at the significance level 34.9%
OffersScreenAppear:
the null hypothesis is not rejected (the shares are the same) at the significance level 44.6%
CartScreenAppear:
the null hypothesis is not rejected (the shares are the same) at the significance level 18.7%
PaymentScreenSuccessful:
the null hypothesis is not rejected (the shares are the same) at the significance level 61.1%
Intermediate conclusions
Preprocessing and adding calculations
Data analysis
Analysis of A/B test results